Skip to content

Conversation

@trentm
Copy link
Member

@trentm trentm commented Jul 18, 2025

The previous sorting of spans based on parentSpanId would fail if there
were 3 or more spans with (a) the same start time millisecond and
(b) had parent/child relationships.

This case was happening frequently for me in local testing of instr-redis-4.test.js

…Collector.sortedSpans

The previous sorting of spans based on parentSpanId would fail if there
were *3 or more* spans with (a) the same start time millisecond and
(b) had parent/child relationships.

This case was happening frequently for me in local testing of instr-redis-4.test.js
@trentm trentm self-assigned this Jul 18, 2025
@trentm trentm requested a review from a team as a code owner July 18, 2025 19:01
@trentm
Copy link
Member Author

trentm commented Jul 18, 2025

longer details

I noticed flaky test failures from test/instr-redis-4.test.js
(soon to become instr-redis.test.js in #889).
For example,

% npm run test-services:start redis
% NODE_OPTIONS='-r dotenv/config' DOTENV_CONFIG_PATH=./test/test-services.env npx tape test/instr-redis-4.test.js
...
# use-redis.mjs (ESM)
# running: (cd "/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/test" && NODE_OPTIONS=--import=@elastic/opentelemetry-node node ./fixtures/use-redis.mjs)
# elapsed: 3.402s
# stdout:
# |{"name":"elastic-otel-node","level":30,"preamble":true,"distroVersion":"1.1.1","env":{"os":"darwin 24.5.0","arch":"arm64","runtime":"Node.js v20.19.1"},"config":{"logLevel":"info"},"msg":"start EDOT Node.js","time":"2025-07-18T18:49:04.019Z"}
# |key "test-f25b834d": "bar"
# stderr: <empty>
ok 34 exited successfully: err=null
not ok 35 should be strictly equal
  ---
    operator: equal
    expected: |-
      'redis-connect'
    actual: |-
      'manual'
    at: Object.checkTelemetry (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/test/instr-redis-4.test.js:67:15)
    stack: |-
      Error: should be strictly equal
          at Test.assert [as _assert] (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/node_modules/tape/lib/test.js:492:48)
          at Test.strictEqual (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/node_modules/tape/lib/test.js:670:7)
          at Object.checkTelemetry (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/test/instr-redis-4.test.js:67:15)
          at done (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/test/testutils.js:671:46)
          at ChildProcess.exithandler (node:child_process:414:7)
          at ChildProcess.emit (node:events:524:28)
          at maybeClose (node:internal/child_process:1104:16)
          at ChildProcess._handle.onexit (node:internal/child_process:304:5)
  ...
not ok 36 should be strictly equal
  ---
    operator: equal
    expected: 'redis'
    actual:   undefined
    at: Object.checkTelemetry (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/test/instr-redis-4.test.js:68:15)
    stack: |-
      Error: should be strictly equal
          at Test.assert [as _assert] (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/node_modules/tape/lib/test.js:492:48)
          at Test.strictEqual (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/node_modules/tape/lib/test.js:670:7)
          at Object.checkTelemetry (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/test/instr-redis-4.test.js:68:15)
          at done (/Users/trentm/el/elastic-otel-node11/packages/opentelemetry-node/test/testutils.js:671:46)
          at ChildProcess.exithandler (node:child_process:414:7)
          at ChildProcess.emit (node:events:524:28)
          at maybeClose (node:internal/child_process:1104:16)
          at ChildProcess._handle.onexit (node:internal/child_process:304:5)
  ...
ok 37 should be strictly equal
ok 38 should be strictly equal

1..38
# tests 38
# pass  36
# fail  2

It would fail in either or both of the two redis test fixtures.

The issue is that the sorted spans from:

const spans = filterOutDnsNetSpans(col.sortedSpans);

where not sorted by parentSpanId, despite the special code in TestCollector.sortedSpans to attempt to sort by parentSpanId.

It turns out that code was a little naive. Just considering the adjacent spans in the sort, i.e. only looking at a and b in spans.sort((a, b) => { ... }); is insufficient if there are 3 or more spans with the same start time. This is the case we are sometimes hitting, but it was masked by filterOutDnsNetSpans -- one of the 3 spans was a net span. For example:

  {
    traceId: '1294950a289847de83eabf7325ec951d',
    spanId: '3ba170202ce0735d',
    parentSpanId: '92193a68b9ee021c',
    name: 'tcp.connect',
    startTimeUnixNano: '1752862079619000000',
    endTimeUnixNano: '1752862079625546292'
  },
  {
    traceId: '1294950a289847de83eabf7325ec951d',
    spanId: '92193a68b9ee021c',
    parentSpanId: '429721a9d51ce368',
    name: 'redis-connect',
    startTimeUnixNano: '1752862079619000000',
    endTimeUnixNano: '1752862079636120000'
  },
  ...
  {
    traceId: '1294950a289847de83eabf7325ec951d',
    spanId: '429721a9d51ce368',
    parentSpanId: undefined,
    name: 'manual-parent-span',
    startTimeUnixNano: '1752862079619000000',
    endTimeUnixNano: '1752862079639812042'
  },

With the current algorithm, it results in this sort (i.e. it handles the parentSpanId for the first two, but not for third):

  {
    traceId: '1294950a289847de83eabf7325ec951d',
    spanId: '92193a68b9ee021c',
    parentSpanId: '429721a9d51ce368',
    name: 'redis-connect',
    startTimeUnixNano: '1752862079619000000',
    endTimeUnixNano: '1752862079636120000'
  },
  {
    traceId: '1294950a289847de83eabf7325ec951d',
    spanId: '3ba170202ce0735d',
    parentSpanId: '92193a68b9ee021c',
    name: 'tcp.connect',
    startTimeUnixNano: '1752862079619000000',
    endTimeUnixNano: '1752862079625546292'
  },
  {
    traceId: '1294950a289847de83eabf7325ec951d',
    spanId: '429721a9d51ce368',
    parentSpanId: undefined,
    name: 'manual-parent-span',
    startTimeUnixNano: '1752862079619000000',
    endTimeUnixNano: '1752862079639812042'
  },
  ...

This PR improves the algoritm to ensure any child span "starts" after its
parent, but artificially adding nanoseconds. If OTel JS spans get nanosecond
resolution at some point, this will be a bogus algorithm.

@trentm trentm merged commit fcf90a2 into main Jul 23, 2025
13 checks passed
@trentm trentm deleted the trentm-flaky-instr-redis-test branch July 23, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants